Skip to content

Add ScalarValue::RunEndEncoded variant#19895

Merged
alamb merged 8 commits intoapache:mainfrom
Jefffrey:ree-scalarvalue
Feb 2, 2026
Merged

Add ScalarValue::RunEndEncoded variant#19895
alamb merged 8 commits intoapache:mainfrom
Jefffrey:ree-scalarvalue

Conversation

@Jefffrey
Copy link
Copy Markdown
Contributor

@Jefffrey Jefffrey commented Jan 20, 2026

Which issue does this PR close?

Rationale for this change

Support RunEndEncoded scalar values, similar to how we support for Dictionary.

What changes are included in this PR?

  • Add new ScalarValue::RunEndEncoded enum variant
  • Fix ScalarValue::new_default to support Decimal32 and Decimal64
  • Support RunEndEncoded type in proto for both ScalarValue message and ArrowType message

Are these changes tested?

Added tests.

Are there any user-facing changes?

New variant for ScalarValue

Protobuf changes to support RunEndEncoded type

@github-actions github-actions bot added sql SQL Planner common Related to common crate proto Related to proto crate labels Jan 20, 2026
@Jefffrey Jefffrey changed the title Ree scalarvalue Add ScalarValue::RunEndEncoded variant Jan 20, 2026
Comment on lines +432 to +433
/// (run-ends field, value field, value)
RunEndEncoded(FieldRef, FieldRef, Box<ScalarValue>),
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mimicking the arrow type where it stores fields:

I tried initially only storing the index DataType and the ScalarValue value, but figured it would be better to try be as accurate as possible 🤔

Comment on lines +1604 to +1605
| DataType::Decimal32(_, _)
| DataType::Decimal64(_, _)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little fix since we were missing these


// Unsupported types for now
_ => {
DataType::ListView(_) | DataType::LargeListView(_) => {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just getting rid of the catch-all to be more rigorous

_ => unreachable!("Invalid dictionary keys type: {}", key_type),
}
}
DataType::RunEndEncoded(run_ends_field, value_field) => {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're building the runarray efficiently here, as unlike dictionary above which would require keeping a hashmap of values to build an efficient dictionary array, run arrays are simpler in that we just need to track when a new run starts.

Most of the verbosity here is related to destructuring input ScalarValues and ensuring we have consistent types from them.

let run_ends = PrimitiveArray::<R>::from_iter_values(run_ends);
let values = ScalarValue::iter_to_array(value_scalars)?;

// Using ArrayDataBuilder so we can maintain the fields
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is the only way to construct runarrays with fields we want, since try_new creates the fields for us:

https://github.com/apache/arrow-rs/blob/ddb6c42194fa45516e1bd4a27cdacf10fda56b5a/arrow-array/src/array/run_array.rs#L99-L105

Comment thread datafusion/common/src/scalar/mod.rs
);
let err = scalar.eq_array(&run_array, 1).unwrap_err();
let expected = "Internal error: could not cast array of type Float32 to arrow_array::array::primitive_array::PrimitiveArray<arrow_array::types::Float64Type>";
assert!(err.to_string().starts_with(expected));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to use starts_with since backtrace feature can affect the error message, so direct equality can succeed for cargo test but fail in CI

return Err(Error::General(
"Proto serialization error: The RunEndEncoded data type is not yet supported".to_owned()
))
DataType::Decimal32(precision, scale) => {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only change here is for RunEndEncoded; for some reason other formatting changes were applied for the other arms here

@Jefffrey Jefffrey marked this pull request as ready for review January 24, 2026 08:38
@alamb alamb added the api change Changes the API exposed to users of the crate label Jan 27, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Jan 27, 2026

FYI @brancz

Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't say I reviewed every line of this PR carefully but I did read them all and they look structurally good to me -- thank you for pushing this along @Jefffrey

Comment thread datafusion/common/src/scalar/mod.rs
}
(Dictionary(_, _), _) => None,
(RunEndEncoded(rf1, vf1, v1), RunEndEncoded(rf2, vf2, v2)) => {
// Don't compare if the run ends fields don't match (it is effectively a different datatype)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is exactly what we want. The run arrays could be logically identical, but their index types might differ. I don't think we'd want the scalar not to equal in that case. I realize that's not what we have for dictionaries either, but is that really the intention of scalars? My understanding has always been that the integer width of codes should be irrelevant from a logical equality perspective.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we want that logic as this level; for example if we fix PartialOrd here to compare REE/Dicts based on inner values only, then we'd probably have to do the same for PartialEq right? But then we run into an issue with Hash not being consistent unless we also fix Hash 🤔

I think it might be better to leave these as is, and if we want proper comparison it would make more sense to do at a high level (e.g. via type coercion)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do believe that to be the right path ultimately, but I can agree to starting with this and consistently change things as a follow up.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's file a follow on ticket.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb alamb added this pull request to the merge queue Feb 2, 2026
@alamb
Copy link
Copy Markdown
Contributor

alamb commented Feb 2, 2026

Thanks again @Jefffrey and @brancz

Merged via the queue into apache:main with commit 39da29f Feb 2, 2026
28 checks passed
@Jefffrey Jefffrey deleted the ree-scalarvalue branch February 3, 2026 02:37
@vegarsti
Copy link
Copy Markdown
Contributor

vegarsti commented Feb 3, 2026

Nice!

de-bgunter pushed a commit to de-bgunter/datafusion that referenced this pull request Mar 24, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#18563

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Support RunEndEncoded scalar values, similar to how we support for
Dictionary.

## What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- Add new `ScalarValue::RunEndEncoded` enum variant
- Fix `ScalarValue::new_default` to support `Decimal32` and `Decimal64`
- Support RunEndEncoded type in proto for both `ScalarValue` message and
`ArrowType` message

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Added tests.

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

New variant for `ScalarValue`

Protobuf changes to support RunEndEncoded type

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
asubiotto pushed a commit to polarsignals/datafusion that referenced this pull request Mar 26, 2026
<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes apache#123` indicates that this PR will close issue apache#123.
-->

- Closes apache#18563

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Support RunEndEncoded scalar values, similar to how we support for
Dictionary.

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

- Add new `ScalarValue::RunEndEncoded` enum variant
- Fix `ScalarValue::new_default` to support `Decimal32` and `Decimal64`
- Support RunEndEncoded type in proto for both `ScalarValue` message and
`ArrowType` message

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Added tests.

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

New variant for `ScalarValue`

Protobuf changes to support RunEndEncoded type

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

---------

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate common Related to common crate proto Related to proto crate sql SQL Planner

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Missing ScalarValue variant for RunEndEncoded

4 participants